{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overlap Similarity\n",
    "----\n",
    "\n",
    "In this notebook we will explore the Overlap Coefficient and compare it again Jaccard.  Similarity can be between neighboring vertices (default) or second hop neighbors\n",
    "\n",
    "\n",
    "Notebook Credits\n",
    "\n",
    "    Original Authors: Brad Rees\n",
    "    Created:   10/14/2019\n",
    "    Last Edit: 05/08/2020\n",
    "\n",
    "RAPIDS Versions: 0.12.0a\n",
    "\n",
    "Test Hardware\n",
    "* GV100 32G, CUDA 10.0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction - Common Neighbor Similarity \n",
    "\n",
    "One of the most common types of vertex similarity is to evaluate the neighborhood of vertex pairs and looks at the number of common neighbors.  TThat type of similar comes from statistics and is based on set comparison.  Both Jaccard and the Overlap Coefficient operate on sets, and in a graph setting, those sets are the list of neighboring vertices. <br>\n",
    "For those that like math:  The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.\n",
    "\n",
    "For the rest of this introduction, set __A__ will equate to _A = N(i)_ and set __B__ will quate to _B = N(j)_.  That just make the rest of the text more readable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Overlap Coefficient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Overlap Coefficient between two sets is defined as the ratio of the volume of their intersection divided by the volume of the smaller set.\n",
    "The Overlap Coefficient can be defined as\n",
    "\n",
    "<a href=\"https://www.codecogs.com/eqnedit.php?latex=oc(A,B)&space;=&space;\\frac{|A|&space;\\cap&space;|B|}{min(|A|,&space;|B|)&space;}\" target=\"_blank\"><img src=\"https://latex.codecogs.com/gif.latex?oc(A,B)&space;=&space;\\frac{|A&space;\\cap&space;B|}{min(|A|,&space;|B|)&space;}\" title=\"oc(A,B) = \\frac{|A \\cap B|}{min(|A|, |B|) }\" /></a>\n",
    "\n",
    "To compute the Overlap Coefficient between all pairs of vertices connected by an edge in cuGraph use: <br>\n",
    "\n",
    "__df = cugraph.overlap(G)__\n",
    "\n",
    "    G: A cugraph.Graph object\n",
    "\n",
    "Returns:\n",
    "\n",
    "    df: cudf.DataFrame with three names columns:\n",
    "        df[\"source\"]: The source vertex id.\n",
    "        df[\"destination\"]: The destination vertex id.\n",
    "        df[\"overlap_coeff\"]: The overlap coefficient computed between the source and destination vertex.\n",
    "\n",
    "__References__\n",
    "- https://en.wikipedia.org/wiki/Overlap_coefficient\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Refresh on Jaccard\n",
    "The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. \n",
    "\n",
    "The Jaccard Similarity can then be defined as\n",
    "\n",
    "<a href=\"https://www.codecogs.com/eqnedit.php?latex=js(A,B)&space;=&space;\\frac{|A&space;\\cap&space;B|}{|A&space;\\cup&space;B&space;|&space;}&space;=&space;\\frac{|A&space;\\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\\cup&space;B&space;|&space;}\" target=\"_blank\"><img src=\"https://latex.codecogs.com/gif.latex?js(A,B)&space;=&space;\\frac{|A&space;\\cap&space;B|}{|A&space;\\cup&space;B&space;|&space;}&space;=&space;\\frac{|A&space;\\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\\cup&space;B&space;|&space;}\" title=\"js(A,B) = \\frac{|A \\cap B|}{|A \\cup B | } = \\frac{|A \\cap B|}{ |A| + |B| - |A \\cup B | }\" /></a>\n",
    "\n",
    "\n",
    "To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use: <br>\n",
    "__df = cugraph.jaccard(G)__\n",
    "\n",
    "    G: A cugraph.Graph object\n",
    "\n",
    "Returns:\n",
    "\n",
    "    df: cudf.DataFrame with three names columns:\n",
    "        df[\"source\"]: The source vertex id.\n",
    "        df[\"destination\"]: The destination vertex id.\n",
    "        df[\"jaccard_coeff\"]: The jaccard coefficient computed between the source and destination vertex.\n",
    "<br>\n",
    "\n",
    "See the Jaccard notebook for additional information and background"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Additional Reading\n",
    "- [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)\n",
    "- [Wikipedia: Overlap Coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### cuGraph Notice \n",
    "The current version of cuGraph has some limitations:\n",
    "\n",
    "* Vertex IDs need to be 32-bit integers.\n",
    "* Vertex IDs are expected to be contiguous integers starting from 0.\n",
    "\n",
    "cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test Data\n",
    "We will be using the Zachary Karate club dataset \n",
    "*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of\n",
    "Anthropological Research 33, 452-473 (1977).*\n",
    "\n",
    "\n",
    "![Karate Club](../img/zachary_black_lines.png)\n",
    "\n",
    "This is a small graph which allows for easy visual inspection to validate results.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Import needed libraries\n",
    "import cugraph\n",
    "import cudf\n",
    "from collections import OrderedDict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "### Define some Print functions\n",
    "(the `del` are not needed since going out of scope should free memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function for printing the top most similar vertices\n",
    "def print_most_similar_jaccard(df):\n",
    "    \n",
    "    jmax = df['jaccard_coeff'].max()\n",
    "    dm = df.query('jaccard_coeff >= @jmax')    \n",
    "    \n",
    "    #find the best\n",
    "    for i in range(len(dm)):    \n",
    "        print(\"Vertices \" + str(dm['source'].iloc[i]) + \" and \" + \n",
    "              str(dm['destination'].iloc[i]) + \" are most similar with score: \" \n",
    "              + str(dm['jaccard_coeff'].iloc[i]))\n",
    "    del jmax\n",
    "    del dm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function for printing the top most similar vertices\n",
    "def print_most_similar_overlap(df):\n",
    "    \n",
    "    smax = df['overlap_coeff'].max()\n",
    "    dm = df.query('overlap_coeff >= @smax and source < destination')      \n",
    "    \n",
    "    for i in range(len(dm)):\n",
    "        print(\"Vertices \" + str(dm['source'].iloc[i]) + \" and \" + \n",
    "          str(dm['destination'].iloc[i]) + \" are most similar with score: \" \n",
    "          + str(dm['overlap_coeff'].iloc[i]))\n",
    "        \n",
    "    del smax\n",
    "    del dm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function for printing jaccard similar vertices based on a threshold\n",
    "def print_jaccard_threshold(_d, limit):\n",
    "    \n",
    "    filtered = _d.query('jaccard_coeff > @limit')\n",
    "    \n",
    "    for i in range(len(filtered)):\n",
    "        print(\"Vertices \" + str(filtered['source'].iloc[i]) + \" and \" + \n",
    "            str(filtered['destination'].iloc[i]) + \" are similar with score: \" + \n",
    "            str(filtered['jaccard_coeff'].iloc[i]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a function for printing similar vertices based on a threshold\n",
    "def print_overlap_threshold(_d, limit):\n",
    "    \n",
    "    filtered = _d.query('overlap_coeff > @limit')\n",
    "    \n",
    "    for i in range(len(filtered)):\n",
    "        if filtered['source'].iloc[i] != filtered['destination'].iloc[i] :\n",
    "            print(\"Vertices \" + str(filtered['source'].iloc[i]) + \" and \" + \n",
    "                str(filtered['destination'].iloc[i]) + \" are similar with score: \" + \n",
    "                str(filtered['overlap_coeff'].iloc[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the CSV datafile using cuDF\n",
    "data file is actually _tab_ separated, so we need to set the delimiter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test file  \n",
    "datafile='../data/karate-data.csv'\n",
    "\n",
    "gdf = cudf.read_csv(datafile, delimiter='\\t', names=['src', 'dst'], dtype=['int32', 'int32'] )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's look at the DataFrame. There should be two columns and 156 records\n",
    "gdf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Look at the first few data records - the output should be two colums src and dst\n",
    "gdf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a Graph \n",
    "G = cugraph.Graph()\n",
    "G.from_cudf_edgelist(gdf, source='src', destination='dst')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many vertices are in the graph?  Remember that Graph is zero based\n",
    "G.number_of_vertices()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_The test graph has only 34 vertices, so why is the Graph listing 35?_\n",
    "\n",
    "As mentioned above, cuGraph vertex numbering is zero-based, meaning that the first vertex ID starts at zero.  The test dataset is 1-based.  Because of that, the Graph object adds an extra isolated vertex with an ID of zero.  Hence the difference in vertex count.  \n",
    "We could have run _renumbering_ on the data, or updated the value of each element _gdf['src'] = gdf['src'] - 1_    \n",
    "for now, we will just state that vertex 0 is not part of the dataset and can be ignored"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--- \n",
    "# Jaccard "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%time\n",
    "# Call cugraph.nvJaccard \n",
    "jdf = cugraph.jaccard(G)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Which two vertices are the most similar?\n",
    "print_most_similar_jaccard(jdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Most similar shoul be 33 and 34.\n",
    "Vertex 33 has 12 neighbors, vertex 34 has 17 neighbors.  They share 10 neighbors in common:\n",
    "$jaccard = 10 / (10 + (12 -10) + (17-10)) = 10 / 19 = 0.526$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### let's look at all similarities over a threshold\n",
    "print_jaccard_threshold(jdf, 0.4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared\n",
    "#\n",
    "# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that\n",
    "# by performing a query.  Then let's sort the data by score\n",
    "\n",
    "jdf_s = jdf.query('source < destination').sort_values(by='jaccard_coeff', ascending=False)\n",
    "\n",
    "print_jaccard_threshold(jdf_s, 0.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# Overlap Coefficient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Noticed that the Jaccard score is based on the number of common items over the combined (union) set of items.  That makes sense when the two sets being compared are relativcely close in size.  However, when one set is considerable larger, then it is important to know if one set is a proper subset of the other <br>\n",
    "See:  [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%time\n",
    "# Call cugraph.nvJaccard \n",
    "odf = cugraph.overlap(G)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print the top similar pair - this function include code to drop duplicates  \n",
    "print_most_similar_overlap(odf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print all similarities over a threshold, in this case 0.5\n",
    "#also, drop duplicates\n",
    "odf_s = odf.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
    "\n",
    "print_overlap_threshold(odf_s, 0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Expanding similarity scoring to 2-hop vertex pair"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get all two-hop vertex pairs\n",
    "p = G.get_two_hop_neighbors()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's look at the Jaccard score\n",
    "ol2 = cugraph.overlap(G, vertex_pair=p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print_most_similar_overlap(odf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print all similarities over a threshold, in this case 0.5\n",
    "#also, drop duplicates\n",
    "odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)\n",
    "\n",
    "print_overlap_threshold(odf_s2, 0.74)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's now compare the Overlap Coefficient with the Jaccard Similarity "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call cugraph.nvJaccard \n",
    "jdf = cugraph.jaccard(G)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Which two vertices are the most similar?\n",
    "print_most_similar_jaccard(jdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's combine the Jaccard and Overlap scores\n",
    "mdf = jdf.merge(odf, on=['source','destination'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Also want to include the vertex degree\n",
    "degree = G.degree()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dS = degree.rename(columns={'vertex':'source','degree': 'src_degree'})\n",
    "dD = degree.rename(columns={'vertex':'destination','degree': 'dst_degree'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = mdf.merge(dS, how=\"left\", on='source')\n",
    "m = m.merge(dD, how=\"left\", on='destination')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m.query('source < destination').sort_values(by='jaccard_coeff', ascending=False).head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now sort on the overlap\n",
    "m.query('source < destination').sort_values(by='overlap_coeff', ascending=False).head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2019-2020, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cugraph_dev",
   "language": "python",
   "name": "cugraph_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
